The goal of this project is to predict customer default in the banking sector by leveraging a dataset containing information on 700 past and 150 prospective bank customers. The dataset encompasses various features such as age, education level, employment duration, address duration, income, debt-related metrics, and the binary default status (Yes or No).
The project aims to explore the relationships between these features and customer defaults, employing both data visualization and statistical modeling techniques. Through the analysis, I intend to provide insights into the factors influencing default rates and develop predictive models to enhance our understanding and forecasting capabilities in the banking context.
library(foreign)
library(dplyr)
library(ggplot2)
library(plotly)
library(ggcorrplot)
library(xgboost)
library(margins)
path <- "D:/Studies/Materials/Second-cycle/I year/I semester/Coding for DS and DM/R/r-project"
data <- read.spss(paste0(path, "/bankloan.sav"), to.data.frame = TRUE) %>%
select(-matches("preddef"))
head(data)
## age ed employ address income debtinc creddebt
## 1 41 Some college 17 12 176 9.3 11.359392
## 2 27 Did not complete high school 10 6 31 17.3 1.362202
## 3 40 Did not complete high school 15 14 55 5.5 0.856075
## 4 41 Did not complete high school 15 14 120 2.9 2.658720
## 5 24 High school degree 2 0 28 17.3 1.787436
## 6 41 High school degree 5 5 25 10.2 0.392700
## othdebt default
## 1 5.008608 Yes
## 2 4.000798 No
## 3 2.168925 No
## 4 0.821280 No
## 5 3.056564 Yes
## 6 2.157300 No
The bankloan.sav dataset, provided by IBM, contains
information on 700 past and 150 prospective bank customers.
The variables of interest include:
age: Age in years,
ed: Level of education,
employ: Number of years with the current employer,
address: Number of years at the current address,
income: Household income in thousands,
debtinc: Debt-to-income ratio in percentage,
creddebt: Credit card debt in thousands,
othdebt: Other debt in thousands,
default: Default status (“Yes” or “No”).
str(data)
## 'data.frame': 850 obs. of 9 variables:
## $ age : num 41 27 40 41 24 41 39 43 24 36 ...
## $ ed : Factor w/ 5 levels "Did not complete high school",..: 3 1 1 1 2 2 1 1 1 1 ...
## $ employ : num 17 10 15 15 2 5 20 12 3 0 ...
## $ address : num 12 6 14 14 0 5 9 11 4 13 ...
## $ income : num 176 31 55 120 28 25 67 38 19 25 ...
## $ debtinc : num 9.3 17.3 5.5 2.9 17.3 10.2 30.6 3.6 24.4 19.7 ...
## $ creddebt: num 11.359 1.362 0.856 2.659 1.787 ...
## $ othdebt : num 5.009 4.001 2.169 0.821 3.057 ...
## $ default : Factor w/ 2 levels "No","Yes": 2 1 1 1 2 1 1 1 2 1 ...
## - attr(*, "variable.labels")= Named chr [1:12] "Age in years" "Level of education" "Years with current employer" "Years at current address" ...
## ..- attr(*, "names")= chr [1:12] "age" "ed" "employ" "address" ...
## - attr(*, "codepage")= int 65001
The dataset contains 7 numeric and 2 factor variables.
summary(data)
## age ed employ
## Min. :20.00 Did not complete high school:460 Min. : 0.000
## 1st Qu.:29.00 High school degree :235 1st Qu.: 3.000
## Median :34.00 Some college :101 Median : 7.000
## Mean :35.03 College degree : 49 Mean : 8.566
## 3rd Qu.:41.00 Post-undergraduate degree : 5 3rd Qu.:13.000
## Max. :56.00 Max. :33.000
## address income debtinc creddebt
## Min. : 0.000 Min. : 13.00 Min. : 0.10 Min. : 0.0117
## 1st Qu.: 3.000 1st Qu.: 24.00 1st Qu.: 5.10 1st Qu.: 0.3822
## Median : 7.000 Median : 35.00 Median : 8.70 Median : 0.8851
## Mean : 8.372 Mean : 46.68 Mean :10.17 Mean : 1.5768
## 3rd Qu.:12.000 3rd Qu.: 55.75 3rd Qu.:13.80 3rd Qu.: 1.8984
## Max. :34.000 Max. :446.00 Max. :41.30 Max. :20.5613
## othdebt default
## Min. : 0.04558 No :517
## 1st Qu.: 1.04594 Yes :183
## Median : 2.00324 NA's:150
## Mean : 3.07879
## 3rd Qu.: 3.90300
## Max. :35.19750
Customer age ranges from 20 to 56 years, with a mean of approximately 35.03 years. The majority (460) did not complete high school. Employment duration spans 0 to 33 years, averaging 8.566 years.
Address duration ranges from 0 to 34 years, with a mean of approximately 8.372 years. Income varies from $13,000 to $446,000, averaging $46,680. Debt-to-income ratio ranges from 0.10% to 41.30%, with a mean of 10.17%. Credit card debt ranges from $11.7 to $20,561.3, with an average of $1,576.8. Other debts range from $45.58 to $35,197.50, averaging $3,078.79.
Regarding default status, 517 customers have not defaulted, 183 have defaulted, and 150 missing values refer to prospective customers.
ggplotly(
ggplot(data %>% filter(!is.na(default)), # Filter out prospective customers
aes(x = age, fill = default)) +
geom_density(alpha = 0.5) +
labs(title = "Age Distribution by Default Status"))
ggplotly(
ggplot(data %>% filter(!is.na(default)), # Filter out prospective customers
aes(x = employ, fill = default)) +
geom_density(alpha = 0.5) +
labs(title = "Employment Duration Distribution by Default Status"))
ggplotly(
ggplot(data %>% filter(!is.na(default)), # Filter out prospective customers
aes(x = debtinc, fill = default)) +
geom_density(alpha = 0.5) +
labs(title = "Debt-to-Income Ratio Distribution by Default Status"))
ggplotly(
ggplot(data %>% filter(!is.na(default)), # Filter out prospective customers
aes(x = creddebt, fill = default)) +
geom_density(alpha = 0.5) +
labs(title = "Credit Card Debt Distribution by Default Status"))
The visual narrative conveyed by the plots suggests a compelling relationship between lower age and shorter tenure at the current employer, both seemingly linked to a higher likelihood of default. Conversely, a lower debt-to-income ratio emerges as a potential mitigating factor against default risk. Notably, credit card debt stands out as less pivotal and alone may not be a decisive factor in distinguishing between default and non-default cases.
cormat <- cor(data %>%
filter(!is.na(default)) %>% # Filter out prospective customers
select_if(is.numeric)) # Consider only numeric variables
ggcorrplot(cormat, type = "lower", outline.color = "white", lab = TRUE) +
ggtitle("Correlation Heatmap of Numeric Independent Variables")
The correlation heatmap among numeric independent variables shows modest associations (ranging from -0.03 to 0.63), with none reaching notably high levels. This suggests a lack of significant multicollinearity, supporting the decision to retain all variables for model estimation.
data.1 <- data %>%
mutate(
ed_1 = as.numeric(ed == "Did not complete high school"),
ed_2 = as.numeric(ed == "High school degree"),
ed_3 = as.numeric(ed == "Some college"),
ed_4 = as.numeric(ed == "College degree"),
ed_5 = as.numeric(ed == "Post-undergraduate degree"),
default_num = as.numeric(default == "Yes")
)
data.2 <- data.1 %>%
select(-ed, -ed_1, -default) %>% # Remove "ed_1" to avoid perfect multicollinearity
filter(!is.na(default_num)) # Filter out prospective customers
head(data.2)
## age employ address income debtinc creddebt othdebt ed_2 ed_3 ed_4 ed_5
## 1 41 17 12 176 9.3 11.359392 5.008608 0 1 0 0
## 2 27 10 6 31 17.3 1.362202 4.000798 0 0 0 0
## 3 40 15 14 55 5.5 0.856075 2.168925 0 0 0 0
## 4 41 15 14 120 2.9 2.658720 0.821280 0 0 0 0
## 5 24 2 0 28 17.3 1.787436 3.056564 1 0 0 0
## 6 41 5 5 25 10.2 0.392700 2.157300 1 0 0 0
## default_num
## 1 1
## 2 0
## 3 0
## 4 0
## 5 1
## 6 0
set.seed(123)
indices <- sample(nrow(data.2), size = 0.7*nrow(data.2))
data.train <- data.2[indices, ]
data.test <- data.2[-indices, ]
print(paste0("Size of the training set: ", dim(data.train)[1]))
## [1] "Size of the training set: 489"
print(paste0("Size of the test set: ", dim(data.test)[1]))
## [1] "Size of the test set: 211"
fit.full <- glm(default_num ~ ., family = binomial(), data = data.train)
summary(fit.full)
##
## Call:
## glm(formula = default_num ~ ., family = binomial(), data = data.train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.417248 0.696333 -2.035 0.041820 *
## age 0.029300 0.020757 1.412 0.158065
## employ -0.234730 0.038944 -6.027 1.67e-09 ***
## address -0.102565 0.026824 -3.824 0.000132 ***
## income -0.011210 0.007906 -1.418 0.156205
## debtinc 0.052413 0.033743 1.553 0.120359
## creddebt 0.617149 0.129398 4.769 1.85e-06 ***
## othdebt 0.070464 0.087231 0.808 0.419214
## ed_2 0.658946 0.288566 2.284 0.022400 *
## ed_3 0.042296 0.436072 0.097 0.922732
## ed_4 0.251575 0.529975 0.475 0.635006
## ed_5 1.119653 1.303248 0.859 0.390271
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 564.30 on 488 degrees of freedom
## Residual deviance: 397.68 on 477 degrees of freedom
## AIC: 421.68
##
## Number of Fisher Scoring iterations: 6
Based on the current analysis, variables such as age,
income, debtinc, othdebt, and
certain education levels (ed_3, ed_4,
ed_5) are not statistically significant at the 0.05
significance level. However, I believe that the variable
debtinc is crucial in influencing a customer’s ability to
repay a loan. Consequently, the forthcoming logistic regression model
will be estimated, excluding all variables deemed statistically
insignificant, except for the variable debtinc.
fit.reduced <- glm(default_num ~ . - age - income - othdebt - ed_3 - ed_4 - ed_5,
family = binomial(), data = data.train)
summary(fit.reduced)
##
## Call:
## glm(formula = default_num ~ . - age - income - othdebt - ed_3 -
## ed_4 - ed_5, family = binomial(), data = data.train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.98692 0.31550 -3.128 0.001759 **
## employ -0.21882 0.03323 -6.585 4.54e-11 ***
## address -0.08399 0.02296 -3.659 0.000254 ***
## debtinc 0.08053 0.02136 3.769 0.000164 ***
## creddebt 0.52086 0.09784 5.324 1.02e-07 ***
## ed_2 0.56799 0.26463 2.146 0.031844 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 564.30 on 488 degrees of freedom
## Residual deviance: 401.84 on 483 degrees of freedom
## AIC: 413.84
##
## Number of Fisher Scoring iterations: 6
print(paste0("AIC for the First Logit Model: ", round(AIC(fit.full), 2)))
## [1] "AIC for the First Logit Model: 421.68"
print(paste0("AIC for the Second Logit Model: ", round(AIC(fit.reduced), 2)))
## [1] "AIC for the Second Logit Model: 413.84"
The Akaike information criterion (AIC) for the reduced logistic regression model is lower, indicating that the model performs better with the exclusion of certain variables. Additionally, in this refined model, all variables demonstrate statistical significance at the 0.05 level.
X.train <- as.matrix(data.train %>% select(-default_num))
y.train <- data.train$default_num
d.train <- xgb.DMatrix(X.train, label = y.train)
X.test <- as.matrix(data.test %>% select(-default_num))
y.test <- data.test$default_num
params <- list(objective = "binary:logistic", eval_metric = "logloss")
xgb <- xgboost(params = params, data = d.train, nrounds = 50)
## [1] train-logloss:0.547281
## [2] train-logloss:0.459220
## [3] train-logloss:0.400979
## [4] train-logloss:0.353110
## [5] train-logloss:0.313238
## [6] train-logloss:0.289052
## [7] train-logloss:0.261826
## [8] train-logloss:0.239064
## [9] train-logloss:0.224665
## [10] train-logloss:0.214369
## [11] train-logloss:0.206001
## [12] train-logloss:0.198469
## [13] train-logloss:0.182391
## [14] train-logloss:0.173903
## [15] train-logloss:0.165850
## [16] train-logloss:0.160657
## [17] train-logloss:0.153349
## [18] train-logloss:0.145432
## [19] train-logloss:0.139134
## [20] train-logloss:0.133590
## [21] train-logloss:0.130590
## [22] train-logloss:0.125796
## [23] train-logloss:0.121296
## [24] train-logloss:0.119171
## [25] train-logloss:0.116311
## [26] train-logloss:0.110614
## [27] train-logloss:0.104667
## [28] train-logloss:0.102794
## [29] train-logloss:0.098316
## [30] train-logloss:0.093869
## [31] train-logloss:0.090872
## [32] train-logloss:0.087533
## [33] train-logloss:0.086606
## [34] train-logloss:0.082780
## [35] train-logloss:0.079901
## [36] train-logloss:0.078798
## [37] train-logloss:0.077504
## [38] train-logloss:0.074678
## [39] train-logloss:0.072353
## [40] train-logloss:0.070316
## [41] train-logloss:0.067904
## [42] train-logloss:0.065669
## [43] train-logloss:0.063883
## [44] train-logloss:0.062214
## [45] train-logloss:0.060305
## [46] train-logloss:0.058405
## [47] train-logloss:0.056882
## [48] train-logloss:0.055465
## [49] train-logloss:0.054015
## [50] train-logloss:0.053009
objective = "binary:logistic": This sets the objective
function for the XGBoost model to binary logistic regression, indicating
that the model is being trained for binary classification (0 or 1
outcomes). eval_metric = "logloss": This specifies the
evaluation metric to be used during training. In this case, it is the
log-loss, a common metric for classification problems.
nrounds = 50: This parameter indicates the number of
boosting rounds (iterations) for training. The model will be trained for
50 rounds.
lr.pred <- predict(fit.reduced, data.test, type = "response")
xgb.pred <- predict(xgb, X.test)
lr.confmat <- table(true = y.test, pred = round(lr.pred))
xgb.confmat <- table(true = y.test, pred = round(xgb.pred))
print("Logit Model Confusion Matrix:")
## [1] "Logit Model Confusion Matrix:"
print(lr.confmat)
## pred
## true 0 1
## 0 148 9
## 1 31 23
print("XGBoost Confusion Matrix:")
## [1] "XGBoost Confusion Matrix:"
print(xgb.confmat)
## pred
## true 0 1
## 0 147 10
## 1 32 22
calc_accuracy <- function(confmat) {
return(sum(diag(confmat))/sum(confmat))
}
acc <- sapply(list(lr.confmat, xgb.confmat), calc_accuracy)
print(paste0("Accuracy of the Logit Model: ", round(acc[1], 4)))
## [1] "Accuracy of the Logit Model: 0.8104"
print(paste0("Accuracy of the XGBoost Model: ", round(acc[2], 4)))
## [1] "Accuracy of the XGBoost Model: 0.8009"
The accuracy of the Logistic Regression model is reported as 0.8104, indicating that the model correctly predicted outcomes for approximately 81.04% of the instances in the test dataset. In comparison, the XGBoost model achieved an accuracy of 0.8009, suggesting it correctly predicted outcomes for approximately 80.09% of the instances, slightly lower than the Logit model.
Given the higher accuracy of the Logistic Regression model compared to XGBoost, the conclusion is to prefer the Logit model for making predictions on this particular dataset.
saveRDS(fit.reduced, "fit.reduced.rds")
AME <- summary(margins(glm(default_num ~ employ + address + debtinc + creddebt + ed_2,
family = binomial(), data = data.train)))
AME
## factor AME SE z p lower upper
## address -0.0112 0.0029 -3.8265 0.0001 -0.0169 -0.0055
## creddebt 0.0693 0.0117 5.9143 0.0000 0.0464 0.0923
## debtinc 0.0107 0.0027 4.0081 0.0001 0.0055 0.0160
## ed_2 0.0756 0.0345 2.1895 0.0286 0.0079 0.1433
## employ -0.0291 0.0037 -7.7916 0.0000 -0.0365 -0.0218
Interpretations:
Each additional year of living at the current address
reduces by 0.0112 (on average) the probability that a given
person will default on a loan.
Each additional $1000 of credit card debt increases by 0.0693
(on average) the probability that a given person will default on a
loan.
Each additional percentage point of debt to income ratio
increases by 0.0107 (on average) the probability that a given
person will default on a loan.
A person with a high school degree has a higher probability of
defaulting on a loan compared to an individual who did not complete high
school by 0.0756 (on average).
Each additional year of working with the current employer
reduces by 0.0291 (on average) the probability that a given
person will default on a loan.
All interpretations are given under the ceteris paribus assumption.
oddsratio <- exp(coef(fit.reduced))
oddsratio
## (Intercept) employ address debtinc creddebt ed_2
## 0.3727246 0.8034623 0.9194400 1.0838592 1.6834770 1.7647142
Interpretations:
Each additional year of working with the current employer
reduces by 19.65% (on average) the chance that a given person
will default on a loan.
Each additional year of living at the current address reduces
by 8.06% (on average) the chance that a given person will default on a
loan.
Each additional percentage point of debt to income ratio
increases by 8.39% (on average) the chance that a given person
will default on a loan.
Each additional $1000 of credit card debt increases by 68.35%
(on average) the chance that a given person will default on a
loan.
A person with a high school degree has a higher chance of
defaulting on a loan compared to an individual who did not complete high
school by 76.47% (on average).
All interpretations are given under the ceteris paribus assumption.
newdata <- data.1 %>%
filter(is.na(default_num)) # Filter out past customers
predict(fit.reduced, newdata, type = "response")
## 1 2 3 4 5 6
## 0.0119050043 0.0550420559 0.5495592856 0.0769904855 0.3137495993 0.5071960731
## 7 8 9 10 11 12
## 0.3087182473 0.8103898499 0.0881504307 0.1099214472 0.0119921106 0.0244792620
## 13 14 15 16 17 18
## 0.0037937538 0.0032832309 0.2808662793 0.3029158979 0.9432149171 0.0267555391
## 19 20 21 22 23 24
## 0.3624804105 0.0197137690 0.2070791793 0.0416102378 0.0954745068 0.0002040433
## 25 26 27 28 29 30
## 0.2409626785 0.1060022436 0.0307582848 0.0897410456 0.0022571181 0.0921404583
## 31 32 33 34 35 36
## 0.1161042725 0.0078190458 0.5902502073 0.0336127178 0.0261221378 0.1390711157
## 37 38 39 40 41 42
## 0.2828656665 0.0870747638 0.4111603583 0.1938147002 0.5324854517 0.0257144808
## 43 44 45 46 47 48
## 0.0015270696 0.0674604814 0.0430344504 0.7650876319 0.1788327110 0.0049805739
## 49 50 51 52 53 54
## 0.1062297928 0.0024293667 0.0023051076 0.2250349489 0.0712377423 0.0072830792
## 55 56 57 58 59 60
## 0.3662047979 0.0227113025 0.6217402410 0.4022671678 0.3521076868 0.4094913486
## 61 62 63 64 65 66
## 0.0525038408 0.1630601778 0.0867669771 0.8496705865 0.0695080132 0.0357077939
## 67 68 69 70 71 72
## 0.1375358435 0.1048104095 0.0660165321 0.1639280598 0.6656053215 0.0116275198
## 73 74 75 76 77 78
## 0.2616445599 0.6424549859 0.0001744932 0.3849093400 0.4614931960 0.6038249402
## 79 80 81 82 83 84
## 0.1688422865 0.0013299229 0.1070778100 0.5997702963 0.3949930049 0.4808007995
## 85 86 87 88 89 90
## 0.7443479758 0.1837060177 0.0058206773 0.3568944160 0.0478891540 0.7749034519
## 91 92 93 94 95 96
## 0.0779068464 0.2784830770 0.0056742890 0.3254708058 0.0214679766 0.0341424624
## 97 98 99 100 101 102
## 0.3764425042 0.4697291368 0.0025570769 0.3889150449 0.3125489935 0.7886344646
## 103 104 105 106 107 108
## 0.3175598435 0.9906053912 0.1100970153 0.1106981901 0.0224590251 0.8941601132
## 109 110 111 112 113 114
## 0.0609606307 0.1251843631 0.9853058673 0.0128988337 0.0443550470 0.2387587336
## 115 116 117 118 119 120
## 0.0220276772 0.3303856563 0.1658973002 0.2242050726 0.3730342845 0.2304217285
## 121 122 123 124 125 126
## 0.2872739246 0.1544030297 0.6170547161 0.1703493812 0.2018713659 0.5705938427
## 127 128 129 130 131 132
## 0.7472585411 0.0118394777 0.1491414865 0.0297557238 0.0210205280 0.0171486296
## 133 134 135 136 137 138
## 0.0129913424 0.1153478148 0.1170132762 0.0125820129 0.9430188128 0.1222569862
## 139 140 141 142 143 144
## 0.0105773685 0.2074632426 0.9914460841 0.0494227745 0.0095099863 0.2127732915
## 145 146 147 148 149 150
## 0.1536838185 0.0106594749 0.1958628444 0.0290399748 0.3282543609 0.0070464700
My investigation into predicting customer default in the banking sector has provided valuable insights. Initially, visual cues suggested connections between lower age, shorter employment duration, and a higher likelihood of default, with debt-to-income ratio as a mitigating factor. While statistical analysis confirmed the importance of employment duration and debt-to-income ratio, it debunked the significance of age. Furthermore, it unveiled the importance of variables like credit card debt, years at the current address, and education level.
This emphasizes the shift from visual intuition to statistical rigor. While data visualization provides initial insights, statistical analysis plays a crucial role in navigating the complexities of predictive modeling.